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Abstract 

Estimating the first moment of a data stream defined as Fi = X^iefi 2 n}l/j| ^o withhi 
1 ± e-relative error with high probabiUty is a basic and infiuential problem in data stream 

-2l 



/^ ' processing. A tight space bound of 0(e~ log(mM)) is known from the work of |9j. However, 

• . aU known algorithms for this problem require per-update stream processing time of fl{e~'^), 

fj ' with the only exception being the algorithm of [^ that requires per-update processing time of 

0(log^(TOM)(logn)) albeit with sub-optimal space 0(e~'^ log^(mM)). [^ 

In this paper, we present an algorithm for estimating Fi that achieves near-optimality in both 
^ ■ space and update processing time. The space requirement is 0(e^^(log7i + (loge^^) log(mM))) 

0^ , and the per-update processing time is 0((logn) log(e~^)). 

O 

. ■ 1 Introduction 
in 

^-^ • The data stream model serves as an abstraction for a variety of monitoring applications, including, 
data networks, sensor networks, financial data, etc.. In this model, an input stream a is abstracted 
as a potentially infinite sequence of records of the form {pos, i, v), where, i G {1, 2, . . . , n} = [n] and 
i; € Z is the change to the frequency fi of item i. The pos attribute is simply the sequence number 



;h ' of the record. Each input record {pos, i, v) changes fi ■^ fi + v. Thus, fi = ^tp^s i v) ^' ^^^* i^' /« 
is the sum of the changes made to the frequency of i since the inception of the stream. The vector 
/ = [/ii /2) • • • ) fnY is called the frequency vector of the stream. 

The pth frequency moment is defined as Fp = X^j^r^il/il^- The problem of estimating Fp, 
and in particular, the estimation of Fq, Fi and F2, have been fundamental to the development of 
data stream processing techniques and lower bounds. In this paper, we consider the problem of 
estimating Fi to within approximation factor of 1 it e and with probability at least some constant 
c > 0.5, where the probability is taken over the internal random bits used by the algorithm. We 
will say that a randomized algorithm computes an e- approximation to a real valued quantity L, 
provided, it returns L such that \L — L\ < eL, with probability that is at least some absolute 
constant strictly larger than 1/2. Since prior work [1] shows that any deterministic algorithm 
for 0.1-approximation of -Fp, p > requires il.{n) space, we consider the problem of randomized 
e-approximation of Fi . 



^At the IIT Kanpur Workshop on Algorithms for Massive Data Sets, Dec 18-20 2009, Jelani Nelson announced 
the discovery of an algorithm (with David Woodruff) for estimating Fi that uses space 0(e~ log ^^'{mM)) space 
and time 0(log^^-'(mM)). Since their work is unpublished, we are unable to make a comparison. 



We assume that items come from the domain [n] = {1, 2, . . . , n}, each stream update {pos, i, v) 
has \v\ < M and the size of the stream is m i.e. the number of records appearing in the stream. [1] 
presents a seminal randomized sketch technique for e-approximation of F2 in the data streaming 
model using space 0(e~^log(?7zM)) bits. Estimation of Fq (i.e., the number of i G [n] s.t. |/j| 7^ 0) 
was first considered by Flajolet and Martin in [3] and improved in [Il[7l[2]. Since the techniques for 
estimating Fp for p > 2 are substantially different from those used for estimating Fp for < p < 2, 
we do not review this line of work. 

1.1 Review: Previous work on estimating small moments 

We now review existing work on estimating Fp, for p G (0,2]. In terms of lower bounds for 
estimating Fp, Woodruff [13] presents an Q{e~^) space lower bound for the problem of estimating 
Fp, for all p >0. This is improved to 0(e~^ log(e^M)) in [9]. 

The notation X ~ D means that the random variable X has probability distribution D. The 
term i.i.d. stands for independent and identically distributed family of random variables. 

Indyk's estimator. The use of p-stable sketches was pioneered by Indyk [8] for estimating Fp 
for < p < 2. A p-stable sketch is a linear combination X = Y17=i ^i^i where the Sj's are drawn 
independently from the p-stable distribution St{p, 1) with scale factor 1. By property of stable 
distributions, X ^^ St (p, {Fp{a))^'P) . For estimating Fi, keep t = 0{-^) independent 1-stable (i.e., 
Cauchy) sketches Xi,X2, . . . ,Xt and let A = {i/n) ■ medisinl^i\Xr\'' . Then, Fi £ {1 ± e)Fi with 
probability 15/16. Further, Indyk shows that for stable distributions it suffices to, (a) truncate the 
support of the distribution St(p, 1) beyond (mM)^^^' , and, (b) consider the approximation to the 
continuous St{p, 1) distribution by discretizing it using into a grid with interval size {mM / e)~^^' . 
To reduce the number of random bits required to maintain independent sketches, Nisan's 
pseudo-random generator (PRG) |11) is used for fooling space S bounded randomized machine 
computation-here S = 0(e~^ log(e~^mM)). We can assume that the stream is ordered since the 
sketches are linear and therefore their values are independent of the order of item arrivals. For 
each element i, the stable random variables Si{u) for u = 1,2,... ,t are computed from the ith 
chunk of S random bits obtained from Nisan's generator that stretches a seed of length S log n to 
nS bits. The time taken to obtain the ith random bit chunk is 0(e~^log(e~^)(logn)) simple field 
operations on a field of size 0{mMe~^). Kane, Nelson and Woodruff [9j observe that a seed length 
of 0(log(^) log(n)) suffices. 

Li's estimator. Li |10j proposes several new estimators for the estimation of Fp for p £ (0,2), 
most notably the geometric means estimator. These estimators are defined on p-stable sketches 
Xu = X^JGfnl /i'5i(''^)) "u = 1) 2, . . . , t. The geometric means estimator is defined as 

t 
Yp,t = C{p,p/t)-'ll\X,\P/K 

i=l 

where 

C(p,g) = -r(l-^)r(g)sin(^), -l<q<p . 

vr p 2 

Li [To] proves that (i) the estimator is unbiased, that is, E Yp^t = Fp, and, (ii) \Yp^t — Fp\ < eFp 
with probability 1/8 provided, t = il(e~^). 



Other work. Kane, Nelson and Woodruff [9] present algoritlims for estimating Fp for p £ (0, 2) 
that use space that is tight with respect to the lower bounds. The update processing time is is 
0(e^^(loge^^)^/(logloge^^)) simple operations on fields of size (mM)^^^' . 

An estimator for Fp based on the Hss technique was presented in ^ for estimating Fp. Though 
it uses sub-optimal space 0(e~^~P(log(?7iM)^(logn))), it has the best update processing time so 
far, namely, 0(log (mM)). 

1.2 Contributions 

We present a novel algorithm for estimating Fi that is nearly optimal with respect to both space 
and update-processing time. So far, all known algorithms, except the Hss based technique [6] have 
a per- update processing time of 17 (e~^). The Hss technique however is sub-optimal in space and 
requires space 0(e~^(log(?7iM))^(logn)) for estimating Fi. In this paper, we present an algorithm 
for estimating Fi whose resource usage is nearly optimal in terms of both space and time. The 
space requirement of our algorithm is 0((e~^(log(ne~-'^))) log(?7iM) + (logn)(loge^-'^) log(mM))). 
The time for processing each stream update is 0((logn)(loge^^)) simple operations on 0(log(mM)) 
bit numbers. G 

2 Algorithm for estimating Fi 

In this section, we present an algorithm for estimating Fp that has fast update time. We first 
describe the data structure and then the estimator. 

Notation. F;-(fe) is defined as follows. Let |/,J > |/,J > • • • > \fsj- ThenF;-(A;) = E"=fe+il/s,l^- 
Let e be the user-supplied accuracy parameter and set e = e/10. 

Stablesketch and Countsketch structure. The Stablesketch structure is a hash table 
U having C = GAB buckets numbered from 1 to GAB, where, B = 1/e^ and having a hash function 
/i : [n] — )• [C] that is chosen uniformly at random from a hash family Ti mapping [n] — )• [C]. The 
degree of independence required of the hash family will be determined later; for now, it is assumed 
to be fully independent. 

For b G [C] each bucket U[b] of the tables maintains three linear p-stable sketches denoted by 
Xb^i, Xb^2 and Xb^^ as follows. 

n 

Xb,r = Y.fi^bA^^ &G[C],rG {1,2,3} . 

For each value of b and r, the random variables {sb,r(^)}je[n] are independent (this independence 
will be relaxed later). For each value of b, the seeds for the random variables Sb^rii) and Sby{i'), 
for r ^ r' are three-wise independent. Across buckets in the same table, the stable sketches need 
only to be pair-wise independent, that is the seeds for the random variables Sb^rii) and Sb'y{i'), for 
b ^ b' are pair-wise independent. The sketches are updated corresponding to each update {i,v) as 
follows. 

We keep a Countsketch structure [3] consisting of g hash tables Ti,T2, . . . ,Tg, where g = 
O(log^) and each table consists of C buckets. Later, the degree of independence is determined 



See footnote on Page 1 



and reduced. Heavy hitters are identified using (another) Countsketch structure, denoted as 
KHg", that can return an estimate fi of the frequency fi such that |/j — fi\ < 8( ^ ^ ) , with 
constant probabihty of success 127/128. We let this Countsketch structure to have O(logn) 
independent hash tables and functions. The Countsketch data structures together use a total 
space of 0(e~^(logn + log(e~^))) bits. The time taken to update this structure is 0(logn + loge~^) 



2.1 Estimator 

Estimating F2^'^ . The algorithm of [5j is applied to the HH2 data structure to obtain estimates for 
Fl'''-{eB) and Fp{B) that are accurate to factors of 1 ± 1/128 with prob. at least 127/128. 

Heavy and light items. After estimating F2^'^{B), we estimate the frequencies of all heavy- 
hitters. Items are classified according to their estimated frequencies into two categories as follows. 

(0 heavy: ff > 1™ and (ii) ligh. /.'^ < 'JTm . ,i) 

The set of heavy and light items are denoted respectively as H and L. The algorithm obtains 
separate estimates for the contribution to Fp from the heavy items and the light items, and adds 
them to obtain the final estimate. That is, 

F — F^ 4- F^ 

Notation. For any set R C [n], let Fp{R) denote Ylii&R\fif ■ 

The true contributions of the items in H and L are as follows: F}^ = Fp{H), F^ = Fp{L) . 

Heavy estimator. We identify the set H of heavy items as those elements whose estimated 
frequencies satisfy ©(i). Say that the event NoHvYCOLL(i) holds if there is some table index 
j G [g] such that no other heavy item maps to the same bucket as hj{i). That is, 

NoHvYCOLL(i) = 3j G [g] s.t. VA; G H\{i},hj{i) / hj{k), . 

If NoHvYCOLL(i) holds, then, let 9{i) denote the index j G [g] such that i is isolated from all other 
heavy items in its bucket for table Tj. 

For i G H we obtain an estimate as follows. If NoHvyColl(z) holds, then, 9{i) exists and 
let b = /ig(j)(«) be the bucket to which i maps to under /i0(j). Also, let ^j be the AMS 4- wise 
independent hash function mapping items to {1,-1} corresponding to table Tj. The estimate is 
obtained as 

fTj[b] ■ sgn(/,) • Cjii) if NoHvYCOLL(i) holds, where, j = e(i),b = hj{i) 
otherwise. 



The heavy estimate is: F^ = Yli^H ^« 



Light Estimator. For bucket index b G [C] say that the event NoCollsion(6) holds if no heavy 
item maps to bucket b in table U. That is 

NoCollsion(6) =ykeH, h{k) / b . 



The estimate returned is 



beB 



,P/3 
31 



where, d =1/Pr [NoCollsion(6)]= (1 - l/C)"!^!. 

The final estimator is the sum of heavy and hght estimators, namely, Fi = F{^ + Fj" . 

3 Analysis 

Throughout this section, we will assume that e < 1/8, B = e~'^ and C = &AB. 
Claim 1 \H\ < 5.1B with probability 127/128. 

Proof See Appendix \M 

The following lemma is standard from arguments in tail bounds of frequency powers. 

Lemma 3.1 Suppose l/sj > l/sjl ^ • • • ^ l/s„|- Then, for any < p < q, 

n 1 / " \ 

j=B+l S=l ^ 

In particular, for q = 2p, Yl]=B+i\U\^^ ^ if Ej=il/^J^ j • 
Proof See Appendix El 

3.1 Analysis of Light Estimator 

The light estimator Fp is analyzed in the general setting when p G (0,2). 

Let B be the set of buckets in table U such that no element of H maps to any of these buckets, 
that is, S = {6 e [C] I Vi G H, hj{i) ^ b}. 



g/p 

(3) 



P 



Lemma 3.2 E 
Proof 



F^ 

p 



-h,s 



P 



fee Bh(i)=b 






Define 



6/^^^ n„/Q^^3 ___,__„_ ^.._ „A 2^/^ 9\t./_a„:__/^9 



Kp = iC{p,p/3))-\C{p,2p/3)y where, Cip,q) = -T{l - l)r(g)sin(^ 



As shown by Li [lOj, Kp < {tt'^ /36){p'^ + 2) + 1 < 2.5. 

Random variables such as Fp are functions of two independent sets of random bits, namely, 
the hash function h and the bits used by the stable variables denoted as s. To explicitly denote 
this dependence, we will denote by notations such as \/arh,s [Pp] and Eh^siFp) the variance and 
expectation of Fp (or any suitable random variable) over the random seeds of h and s. Then 
notation Es[iv ] is used to emphasize that the expectation is taken over the random bits of s, by 
holding the random bits of h fixed. In effect this is the same as E[Fp \ h]. Therefore, E [Fp~^ = 
Efi [Es [Fp]] , since the random bits used by h and s are independent. 



Lemma 3.3 Uar,,, [F/-] < {K,Cl - 1) E.^lI/^P" + ^(E.^lI/^^)' 



Proof of Lemma 13.31 Denote the estimate of bucket b £ B obtained from the hght estimator to 
be Yb = CL(C(p,p/3))"'|Xfe,ir/'l^6,2r/'|^Mr^'- Then, 



be B b£ B 



(4) 



Let Cl be the probabihty that an item i G L does not conflict with any item in H under the hash 
function hj. Under full independence of h, Cl = (1 — l/Cy^'. 
We have, 



Em [y'] = E Em [n^] + E E., [Y,Y{] . 
beB b,b'&B,b^b' 



(5) 



l^eiheB and Kp = {C{p,p/3))-^{C{p,2p/3)Y 



E n^ 1 " 


= KpClEh 


E { E i/<i')' 


/i 


= KpCl^h 


E(i/<i*+ E i/./<-r, " 


b£Bih) -^ 


-' ^beB{h) h{i)=b ' ^ 


^h{i)=b '^ i^i' ^ -' 












h{i)=h{i')=b 



i^pCl El/»l^^ • Pr[NoCOLLSiON(i)] 



JGL 



+ KpCl E l/i/i'l^ • P^ [^(^) = /l(i')> NoCOLLSION(i 

2 






(6) 



Further, for b ^ b\ and b, b' £ B, 



-h.s 



b=/=b' 



E YbYb' = Eh [E, [^ nn'l^]] = Eft [E, [n|/i] E, [Yh>\h]] , since, 6 / b' and fuh indep. of h. 



bj^b' 



= Cl^\finfi\^P<- [h{i) + /l(i')>N0C0LLSI0N(i),N0C0LLSI0N(j)] 

<(Ei/^n'-Ei/^i'" 



(7) 



iei/ 



jei 



since Pr [/i(i) / /i(i'), NoCOLLSiON(i), NoCollsion(j)] = (1 - 1/C)(1 - 2/C)^ < (1 - 1/C)C|. 
Substituting ([6]) and d?]) into (l5|), we get 

Var,,. {Y\ = E,,. [y^] - (J^l/.^)^ < (i^,C, - 1) El/.l'^ + ^(El/^^' • ■ 



Lemma 3.4 1^/- - F/'I < 6(1.75/8^^^/2 ^ 5/iQy/'^eFp with prob. 35/36 
Proof 



El/.P^<(max|/,|fEl/«l'^ 



iGL 



ieL 



ieL 



'Sm\ 



p/2 



B ) 



F„< 



F, 



g2^2 
2 _ ^ P 



SP/2(8S)l-p/2 P 81-P/2 



(8) 



since, B = l/e^. Further, Kp < {tt'^ /36){p'^ + 2) + 1 < 2.5 and Cl < (1 - \H\/C)-^ < (1 
5.1B/6AB)^^ < 1.1 by Claim[TJ Therefore, by Lemma [3^ and ([8|), we have 

Var[i^f] < {KpCL - 1) J^l/.l'^ + ^(EI/^H' ^ (l-75/8i"^/^ + 2.75/64)e2F^2 



By Chebychev's inequahty, 



Pr 



F^ - F^] I > 6(1.75/8^-^/2 + 2.75/64) ^/^Fp 



1 
< — 
- 36 



3.2 Analysis of Heavy Estimator 

In this section, we analyze the heavy estimator for estimating F(^ . 

For any set K C [n], let ^^''^(i^) = F2 — F2{K) = X^j^^^l/iP- The following lemma is from [5]. 

Lemma 3.5 Let K he the items that are top-k with respect to estimated absolute frequencies using 
the COUNTSKETCH algorithm with table height QAB. Let \K\ = k and suppose ToP-K(fc) be the 
indices of the top-k items of f w.r.t. absolute frequencies. Lf k < 8B, then, F2^^{k) < F2'^^{K) < 
F^^'{k){l + 2Vk + k). 



Proof of Lemma 13.51 

= E i/^i'+ E f? 

i^(T0P-K(k)UK) ieTop-K(k)\K 

< E i/^i'+ E if^ + ^f 

i^(TOP-K(k)UK) ieK\Top-K{k) 

< E l/^l'+ E /' + 2A E l/d + IMTOP-K(fc)|A2 

i^{TOP-K{k)UK) jgX\Top-K(fc) igJ<'\Top-K(fc) 



\e_ft:\Top-K(fc) 



(|K\Top-K(A;)|Ff'^(8B)) 



1/2 



,1/2 



< F^'^'ik) + 2^ — ^ B — {Fnk))''" + A;Ff ^(85) 

< Ff '^(/fc) + 2VkF^^'{k) + A;F|^'^(/t) | 

For a heavy item i £ H, let NoHvYCOLL(i) be the event that i does not collide with any of 
the other heavy items in one of the buckets in the Countsketch structure tables Ti, . . . ,Tg, that 
is, 

NoHvYCOLL(i) = 3r£ [g] s.t. VA; G H\{i},hr{k) / hr{i) 

The event NoHvyColl(//) is said to occur if NoHvYCOLL(i) occurs for each i £ H. That is, 

NoHvYCOLL(iJ) =yi£ H, NoHvYCOLL(i) holds. 

Assuming full independence, Pr [NoHvYCOLL(if)] > l-|i7| (l-(l-^)'^'"y . Since, \H\ < 5.1B, 

31 if „ > iog32|g| 

32 i^ y ^ log{2{|H|-l)/C)- 



C = 64S, we have Pr [NoHvYCOLL(i)] > § if 5 > . (tfm-^\/r) ■ Since \H\ < 5.1B, it suffices 



to let gf = 2 + log 



5.1 



If NoHvYCOLL(i) holds then let j = 9{i) be the index of (some) r G [g\ such that i has no 
collision with any item of H (except itself) under the hash function hj. Let T = Tgj-j), h = h^^^i^ 
and ^ = E,o{i^ ■ Then, let 

Yi = CH ■ Te(i)[/ie(j)(i)] • sgn(/i) • ^e{i){i) ■ 

Although we do not know sgn(/j) we can use sgn(/j) instead which is equal to it with very high 
probability. 

Lemma 3.6 For ie H,E [Yi\ = \fi\. Thus, E [J2ieH ^i] = Pf- 

Proof 

E^ [Y, I NoHvyColl(/?)] = E[fi- sgn(i) • ^(if + J^ /fce(j)e(i)sgn(i)] 

h{k)=h{i),kyti 
= fi-sgn(i) = \fi\ . I 

Lemma 3.7 Let i,k£ H, ij^k. Then, E [YiYk \ NoHvyColl(F)] = |/i||/fc|. 

Proof Let i ^ j and consider YiYj. Assume that NoHvYCOLL(i:/^) holds. Then, 

YiYj = (Te(i) [/i0(i) (i)] • sgn(/i) • ^e{i)ii)) ■ {TeU)[he{j)U)] ■ sgn{fj) ■ ^e(j)U)) ■ 

There are two cases, namely, either (i) 9{i) = 9{j) or (ii) 9{i) / 9{j). 
If t = 0(i) / 9{j) = t', then, 

y,y, = (sgn(/,) Yl /.'6(06(O) • (sgn(/,) Yl Mt'iMt'ij')) 

i':ht{i)=ht,{i') J'--ht'iJ')=htdJ) 

Since t 7^ t' , the two multiplicands use independent random bits, since {^t} are independent of 
{^i'l's. Hence, the expectation of the product is the product of the expectations, the conditional 
on NoHvYCOLL(i7) notwithstanding. Therefore, 

E [YiYj I NoHvYCOLL(ii-) and 9(1) / 9{j)] = |/,||/,| . 
Otherwise, let t = 9{i) = 9{j). Then, 

YiYj = {Tt[ht{i)] ■ sgn(/,) • ^i)) ' {TtMj)] ■ sgn{fj) ■ ^(j)) 
= sgn(/,/,) J2 fjfHiJWni)^ii') 

i':ht{i')=ht{i) 
j':htij')=htij) 

Note that since NoHvyColl(H) holds, ht{i) 7^ ht{k) and therefore, i' 7^ k' . Taking expectations 
and using four-wise independence of the ^'s obtain 

E[YYk I NoHvYCOLL(i/) and 9{i} = 9ij)] = \fi\\fj\ . 

Th Therefore, in all cases, we have 

E [Y^k I NoHvYCOLL(i7)] = \mfj\ i^j,i,jeH . I (9) 

Lemma 3.8 Ife<\,B = l/e^, C = 64B and g = log ^, then Pr [|F/^ - FI^\ < eFi 



> 2 



3- 



Proof Let NoHvyColl be an abbreviation for the event NoHvYCOLL(f/'). Let \H\ = m' . 



Var^ 



^ Yi I NoHvyColl 



i&H 



^(E^ [Y^ I NoHvyColl] - (Eg [Y.-, \ NoHvyColl])^ 



i&H 



+ ^ (Eg [YiXj I NoHvyColl] - Eg [y^ | NoHvyColl] Eg [Yj \ NoHvyColl] 
Yl Yl fk+O, (by Lemma EI 

i£H k:hg^i)(k)=hg,^i){i) 



kj^i,k^H 



Therefore Var^^^g [Eieiif ^* I NoHvyColl] = ^-^F^''''{H) . As in ^, define the event 



LowVar = Varg 



Y Yi I NoHVYCOLL(i7) 



ie-ff 



< 



C 



By Markov's inequahty, Pr^, [LowVar | NoHvyColl] > | . By Chebychev's inequahty. 

,_^ ,_^ /I f7l ??res('8R>i\ 1/2 



Pr 



< 



i€H i€H 

Unconditioning dependencies 



|i:f|Ff« 



c^) 



NoHvyColl and LowVar 



7 
> - 



Pr 



ieH i&H 



< 



165 



1/2- 



7. 



> -Pr [LowVar | NoHvyColl] Pr [NoHvyColl] 

(10) 



7 7 31 2 
> _ . > _ 

- 8 8 32-3 



Recah that \H\ < 5.1B and by LemmaESJ Ff '(F) < 12.04F|^"(|i7|). Therefore, 



|iJ|Ff'(i7) 
64B 



1/2 



< 



12.04|i/|Ff'^(|F| 
646 



^/^ /UMlHlV^^^ 0.44^ , ^ 



6AB H 



Substituting in (fTO]l . we have Pr 



Z/ieH^i zZieHlf-' 



< 3.6eFi 



> 



3- 



3.3 Total Error 

In this section, we add the various errors to obtain the total error of the estimate. 
Theorem 3.9 |Fi - Fi| < lOeFi with prob. 0.576. 

Proof From analysis of light estimator (Lemma 13.41 and setting p = 1) we have 

\F-t - F/-] < 6eFi with probability 35/36. 

By heavy estimator (Lemma 13. 8p we have 

2 

|Fjf - Fjf I < 3.6eFi with prob. -. 

Since, A = F/^ + F^- and Fi = F[ + F/^, we have 

|Fi -Fi| < lOeFi 



with prob. 1 



1 1 



32 36 3 



0.576. 



3.4 Reducing Random Bits 

We now reduce the randomness requirements for the stable sketches and the hash functions. 
Stable Sketches. Using Nisan's PRG, a single stable sketch used in a bucket of a table U may be 
fooled using Nisan's PRG using a seed length of T = 0((log ^^^)(logn)) bits. The three stable 
sketches in each bucket need to be only 3-wise independent. The stable sketches used across the 
buckets of a table U need to be only pair-wise independent to facilitate variance calculations. Thus, 
it suffices to use a pair-wise independent hash function g that maps 3T-bit strings to 3T-bits strings. 
The seeds for each of the buckets is obtained as g{l),g{2), . . . ,g{C). Each of the 3L-bit string is 
viewed as the seed for 3-wise independent hash function /i^. The number of random bits used per 
table is 3L. The seeds for stable sketches across the tables are pair-wise independent, since the 
random variables are used only in variance calculations. Hence we can use a random seed length 
of 0(T) = 0((log^)(logn)). 

Independence of hash functions. There are two occasions where full independence properties of 
hash functions are used, namely, (i) for i £ H, 1/Cl = Pr [NoCOLLSlON(i)] is estimated as (1 — 
1/(7)1^1-1^ and, (ii) Pr [NoHvYCOLL(i:/')] > |i. Let Pr [•] denote the probability of an event under 
full-independence of h and let Prj [•] denote the probability assuming the hash function is t-wise 
independent. Let C£ denote 1/Prt [NoCOLLSiON(i)]. 

Lemma 3.10 If t = log -^, then, for any i £ H 

\Cl'^ - (Ciy^l = |Prt[NoCOLLSION(i)] - Pr[NoCOLLSiON(i)]| < e^ . 
Ift>8 and g>3 + log(e"2) ^/jg^^ p^^ [NoHvyColl] > 31/32. 
Proof See Appendix 1X1 

3.5 Space and Update time 

In this section we summarize the resource consumption of the algorithm. 

The space requirement is 0(e~^(logn + (log -)) log(mM)). The length of the random seed 
is 0((log ^^^— )(logn) + log-) and does not dominate the space requirement. The time taken to 
update the HH2 structure is O(logn). The hash tables Tj and U use 8- wise independent hash 
functions (except Ti and Ui that use 0(logi)-wise independence for estimating F^). The hash 
values are calculated in time 0{g) = 0(log -). Nisan's PRG is used to generate a chunk of size 3T 

bits using 0((logn)(log i)) operations on word size 0(log(mM)j bits. The total update time is 

0((logn)(logi) + (logi) + (logn)) = 0((logn)(loge-i)). 

4 Conclusion 

We first present a novel space-optimal algorithm for estimating Fp over data streams to within 
multiplicative error factor of 1 it e for p G (0, 2]. We then present an algorithm for estimating Fi. 
This algorithm is nearly optimal with respect to both space usage and update processing time. The 
space requirement of the algorithm is 0{e~^ log(ne~^) log(r?T,M)) and a per-update processing time 
of 0((logn)(loge-i)). 
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A Proofs 

Proof of Claim [U fi£ fi± (S!^)!/^ = a (say). For i e H, 

1/2 / 7?rcs/'DN\ 1/2 / prcs/-,R\\ 1/2 



Therefore, 

|F| <45 + (1.04)2s < 5.1S . I (11) 

Proof of Lemma 13.11 Divide the items in order of consecutive groups Gi, G2, . . . , G^njB'] of size 
B items each, that is, Gi contains the first B items in non-increasing order of absolute frequency 
values, G2 contains the next B items, and so on. The last group may contain fewer than B items. 
Let q > p. 



ISi\ 



n \n/B] 

j=B+l 1=2 ieGi 

r-»/-B1 / -I s^q/p 

< 



"'"I /I \ Q/P 

E E U E l/^«l' ' forzGG,, |^r<avg{|/,r:jGQ_i},p>0 
1=2 ieGi ^ ieGi-i ' 

E ^^TPT El^n ^ ^^TFTT El/^.n ' since,, >p. 

l=\ ^ \€G, ^ " ^i = \ ^ 



'I 
< 

The particular case is obtained by setting q = 2p in the above equation. | 

Proof of Lemma 13.101 Fix a table index j G [g] and an item i G H. Let k £ H, k ^ i. 
Define the indicator variable Xk to be 1 if /c collides with i in the same bucket in table U, that 
is, hj{i) = hj{k). Let Y = Ylk&H k^i^k- The event NoCollsion(z) is equivalent to 1" = 0. Let 

/^ = E[y] = ^<il<o.L 

By Theorem 2.6, part (III) of [12j (proved using inclusion-exclusion), if t > e^-|-ln(l/Pr \Y = 0])-|- 
1 + D, then, 

I Pr* [y > 1] - Pr [y > 1] I < (1 - Pr [Y > 0] e"^ . 

We have, Pr[y = 0] = (l-l/C)l^l-i < 2{\H\-l)/C < 1/5. Therefore, for t > 0.1e-Mn(5) -h 1 + 1) 

|Prt[y = 0] - Prt[y = 0]| = |Prt[y > 1] -Pr[y > 1]| < {A/5)e-^ . 

It suffices for the RHS to be e^, which can be satisfied by keeping D = log(l/e^). 
For t > 8, 

Prt [NoHvyColl] > 1 - \H\(i - (1 - Pr^ [NoCOLLSiON(i)]y) 

> 1 - |F| [1 - [1 - Pr [NoCOLLSiON(i)] - (4/5)e"^^ ^ 

>^ 
- 32 

provided, g > log(5.1i?) > 3 + log(e^^). 
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